-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a faster string-search method with simd instructions than std::find #10858
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
e99e699
to
457860e
Compare
@Yuhta from #10731
I found that the performance of the code will be reduced by 30% when placed in the cpp file. So for the case, I change the code like this :
I haven't found a good way to handle this situation, could you give me some advice?
RedHat define the valid values:
|
991be1a
to
54422e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider the other suggestions in #10731 (review)
54422e3
to
2cf06d7
Compare
Updates code for the suggestions in #10731:
done
done
I have extended StringBenchmark to support different algrithm, and kmp is always slower than simdStrStr(maybe my implement is not good so I add std::boyer_moore_searcher for comparison):
I update ut refer to folly uts, add random test cases. Could you have look at this once more? |
@Yuhta ping~ |
58a2c79
to
3ae8be1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark result looks very good
3ae8be1
to
13220d3
Compare
@Yuhta update code
As we inline kPageSize, pageSafe method is faster than before, I move the check to each batch iteration, we only miss the comparison-span-pages case, maybe better, like clickhouse:
new benchmark result:
|
13220d3
to
40a8df4
Compare
0e2a09d
to
b6c6706
Compare
@Yuhta ping~ |
velox/common/base/SimdUtil-inl.h
Outdated
// may not generate a general-protection exception (#GP) in this situation, | ||
// and the address that spans the end of the segment may or may not wrap | ||
// around to the beginning of the segment. | ||
for (; i <= n - needleSize && pageSafe(s + i) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, we will drop out of this fast path as soon as [s + i, s + i + needleSize) across the page boundary? So data in the subsequent pages would still go through slow path, even they are not crossing page boundary anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So data in the subsequent pages would still go through slow path, even they are not crossing page boundary anymore
yes , your understand correctly, this way maybe better than that put check outside because we could still check part of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think we can still use the fast path for the second, and any subsequent pages? Why do we need to drop out early?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let`s detail the case here:
ref : Computer Systems: A Programmer's Perspective section 9.3
Four virtual pages (VP 1, VP 2, VP 4, and VP 7)
are currently cached in DRAM. Two pages (VP 0 and VP 5) have not yet been
allocated, and the rest (VP 3 and VP 6) have been allocated, but are not currently
cached.
I think we only need to avoid to load data at the end of VP 4 because VP 5 is not allocated yet, so we only need to check if the end of input string array is pageSafe
Will we miss the opportunity if it is a super long string that is spanning multiple pages?
And we could only check the lst pointer in substring-search:
if (i + needleSize + CharVector::size > n &&
!pageSafe(s + i + needleSize - 1)) {
break;
}
I think we only miss last page in the case that the end of input string is right at the end of VP4 by this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta update comment and code .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's very clear now, thanks
Any more comments about this ? |
7aa4d26
to
446f09d
Compare
velox/common/base/SimdUtil-inl.h
Outdated
// Assume that the input string is allocated on virtual pages : VP1, VP2, | ||
// VP3 and VP4 has not been allocated yet, we need to check the end of input | ||
// string is page-safe to over-read CharVector. | ||
if (i + needleSize + CharVector::size > n && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition can be loosen to i + needleSize - 1 + CharVector::size > n
. Maybe this is more clear:
const auto last = i + needleSize - 1;
if (last + CharVector::size > n && !pageSafe(s + last)) {
break;
}
// ...
auto blockLast = CharVector::load_unaligned(s + last);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Description
In practice, c_strstr is always faster than std::find in most cases, refer to [1]. However, the c_strstr interface will truncate '\0', which makes it unable to handle unicode characters. In addition, glibc uses two-way- search, see[2], but it does not seem to be exposed as a common interface. It requires some additional code to determine the best path. In some scenarios, the decision-making code may cause performance degradation.
Solution
There are many optimization algorithms for string-search, refer to [7], but they cannot change the essence of the problem, which is a search process of O(n*m) complexity. I think the most common way to improve performance is to optimize the search instructions. For example, doris (the same is true for clickhouse, this part of the code of doris was copied from clickhouse, and the Volnitsky[11] part was removed), see [5].
This PR refers to the implementation of [5], [6], [8] :
[5] doris : first-two comapre first, and the issue of page-safe was also mentioned
[6] stringzilla: first-mid-lst compare first
[8] simd-strstr implement by WojciechMula, first-lst comapre first.
They all look few chars first, the more chars you choose to look the higher probability of hitting the optimized path, and the more cost of 'optimized path' itself.
In the end, the solution I chose is to first compare first-last-chars and then compare the remaining bytes. This is because in practice, our users seem to be able to get the matching results through first-last-chars.
ut & benchmark
This part of the code refers to folly’s test code, see [9] [10], of course, some adjustments have been made based on the code implementation.
Benchmark : std::find vs simdStrStr:
findSuccessful(opt_50_5) 31.66ms 31.59
findSuccessful(opt_100_10) 36.81ms 27.16
findSuccessful(opt_100_20) 96.47ms 10.37
findSuccessful(opt_1k_10) 63.27ms 15.81
findSuccessful(opt_1k_100) 156.95ms 6.37
findSuccessful(std_50_5) 45.24ms 22.11
findSuccessful(std_100_10) 191.23ms 5.23
findSuccessful(std_100_20) 322.11ms 3.10
findSuccessful(std_1k_10) 122.69ms 8.15
findSuccessful(std_1k_100) 168.10ms 5.95
findUnsuccessful(std_first_char_match) 819.13ms 1.22
findUnsuccessful(opt_first_char_match) 261.46ms 3.82
findUnsuccessful(std_first_char_unmatch) 249.12ms 4.01
findUnsuccessful(opt_first_char_unmatch) 199.02ms 5.02
References
[1] gcc std::search issue:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66414
[2] glibc strstr implements:
https://github.com/lattera/glibc/blob/master/string/strstr.c
[3] two way search:
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
[4] std::boyer_moore_searcher
https://en.cppreference.com/w/cpp/utility/functional/boyer_moore_searcher
https://github.com/mclow/search-library?tab=readme-ov-file
[5] doris string-search
https://github.com/apache/doris/blob/4e326e5f496109edb253d094869d057573ab7a59/be/src/vec/common/string_searcher.h#L44
[6] stringzilla
https://github.com/ashvardanian/StringZilla/blob/main/include/stringzilla/stringzilla.h#L2187
[7] string-search algorithms
https://www-igm.univ-mlv.fr/~lecroq/string/
[8] sse4 strst
https://github.com/WojciechMula/sse4-strstr
[9] folly benchmark
https://github.com/facebook/folly/blob/ce5edfb9b08ead9e78cb46879e7b9499861f7cd2/folly/test/FBStringTestBenchmarks.cpp.h
[10] folly ut
https://github.com/facebook/folly/blob/ce5edfb9b08ead9e78cb46879e7b9499861f7cd2/folly/test/FBStringTest.cpp#L1277
[11] clickhouse Volnitsky
https://github.com/ClickHouse/ClickHouse/blob/a51f867ce014a4cb8f97c46e004b90f5feafd80f/src/Common/Volnitsky.h#L482